Deterministic Policy Gradient Algorithms: Supplementary Material

ثبت نشده
چکیده

A. Regularity Conditions Within the text we have referred to regularity conditions on the MDP: Regularity conditions A.1: p(s′|s, a), ∇ap(s|s, a), μθ(s), ∇θμθ(s), r(s, a), ∇ar(s, a), p1(s) are continuous in all parameters and variables s, a, s′ and x. Regularity conditions A.2: there exists a b and L such that sups p1(s) < b, supa,s,s′ p(s′|s, a) < b, supa,s r(s, a) < b, supa,s,s′ ||∇ap(s|s, a)|| < L, and supa,s ||∇ar(s, a)|| < L. B. Proof of Theorem 1 proof of Theorem 1. The proof follows along the same lines of the standard stochastic policy gradient theorem in Sutton et al. (1999). Note that the regularity conditions A.1 imply that V μθ (s) and ∇θV μθ (s) are continuous functions of θ and s and the compactness of S further implies that for any θ, ||∇θV μθ (s)||, || ∇aQθ (s, a)|a=μθ(s) || and ||∇θμθ(s)|| are bounded functions of s. These conditions will be necessary to exchange derivatives and integrals, and the order of integration whenever necessary in the following proof. We have,

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deterministic Policy Gradient Algorithms

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensu...

متن کامل

Revisiting stochastic off-policy action-value gradients

Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural ...

متن کامل

Supplementary Material: Proximal Deep Structured Models

In this supplementary material we first show the analogy between other proximal methods and our proposed deep structured model, including proximal gradient method and alternating direction method of multipliers. After that, we provide more quantitive results on the three experiments. 1 More Proximal Algorithms Examples Let us the consider the problem we defined in Eq. 1 in our main submission. ...

متن کامل

Multi-objective Reinforcement Learning with Continuous Pareto Frontier Approximation Supplementary Material

This paper is about learning a continuous approximation of the Pareto frontier in Multi–Objective Markov Decision Problems (MOMDPs). We propose a policy–based approach that exploits gradient information to generate solutions close to the Pareto ones. Differently from previous policy–gradient multi–objective algorithms, where n optimization routines are use to have n solutions, our approach perf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014